9 research outputs found

    Conditional heavy hitters : detecting interesting correlations in data streams

    Get PDF
    The notion of heavy hitters—items that make up a large fraction of the population—has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of conditional heavy hitters to identify such items, with applications in network monitoring and Markov chain modeling. We explore the relationship between conditional heavy hitters and other related notions in the literature, and show analytically and experimentally the usefulness of our approach. We introduce several algorithm variations that allow us to efficiently find conditional heavy hitters for input data with very different characteristics, and provide analytical results for their performance. Finally, we perform experimental evaluations with several synthetic and real datasets to demonstrate the efficacy of our methods and to study the behavior of the proposed algorithms for different types of data

    Evaluating pre-trained Sentence-BERT with class embeddings in active learning for multi-label text classification

    Get PDF
    The Transformer Language Model is a powerful tool that has been shown to excel at various NLP tasks and has become the de-facto standard solution thanks to its versatility. In this study, we employ pre-trained document embeddings in an Active Learning task to group samples with the same labels in the embedding space on a legal document corpus. We find that the calculated class embeddings are not close to the respective samples and consequently do not partition the embedding space in a meaningful way. In addition, we explore using the class embeddings as an Active Learning strategy with dramatically reduced results compared to all baselines

    Characterizing Home Device Usage From Wireless Traffic Time Series

    Get PDF
    International audienceThe analysis of temporal behavioral patterns of home network users can reveal important information to Internet Service Providers (ISPs) and help them to optimize their networks and offer new services (e.g., remote software upgrades, troubleshooting, energy savings). This study uses time series analysis of continuous traffic data from wireless home networks , to extract traffic patterns recurring within, or across homes, and assess the impact of different device types (fixed or portable) on home traffic. Traditional techniques for time series analysis are not suited in this respect, due to the limited stationary and evolving distribution properties of wireless home traffic data. We propose a novel framework that relies on a correlation-based similarity measure of time series , as well as a notion of strong stationarity to define motifs and dominant devices. Using this framework, we analyze the wireless traffic collected from 196 home gateways over two months. The proposed approach goes beyond existing application-specific analysis techniques, such as analysis of wireless traffic, which mainly rely on data aggregated across hundreds, or thousands of users. Our framework, enables the extraction of recurring patterns from traffic time series of individual homes, leading to a much more fine-grained analysis of the behavior patterns of the users. We also determine the best time aggregation policy w.r.t. to the number and statistical importance of the extracted motifs, as well as the device types dominating these motifs and the overall gateway traffic. Our results show that ISPs can exceed the simple observation of the aggregated gateway traffic and better understand their networks

    Mining and Learning in Sequential Data Streams: Interesting Correlations and Classification in Noisy Settings

    Get PDF
    Sequential data streams describe a variety of real life processes: from sensor readings of natural phenomena to robotics, moving trajectories and network monitoring scenarios. An item in a sequential data stream often depends on its previous values, subsequent items being strongly correlated. In this thesis we address the problem of extracting the most significant sequential patterns from a data stream, with applications to real-time data summarization and classification and estimating generative models of the data. The first contribution of this thesis is the notion of Conditional Heavy Hitters, which describes the items that are frequent conditionally – that is, within the context of their parent item. Conditional Heavy Hitters are useful in a variety of applications in sensor monitoring, analysis, Markov chain modeling, and more. We develop algorithms for efficient detection of Conditional Heavy Hitters depending on the characteristics of the data, and provide analytical quality guarantees for their performance. We also study the behavior of the proposed algorithms for different types of data and demonstrate the efficacy of our methods by experimental evaluation on several synthetic and real-world datasets. The second contribution of the thesis is the extension of Conditional Heavy Hitters to patterns of variable order, which we formalize in the notion of Variable Order Conditional Heavy Hitters. The significance of the variable order patterns is measured in terms of high conditional and joint probability and their difference from the independent case in terms of statistical significance. The approximate online solution in the variable order case exploits lossless compression approaches. Facing the tradeoff between memory usage and accuracy of the pattern extraction, we introduce several online space pruning strategies and study their quality guarantees. The strategies can be chosen depending on the estimation objectives, such as maximizing the precision or recall of extracted significant patterns. The efficiency of our approach is experimentally evaluated on three real datasets. The last contribution of the thesis is related to the prediction quality of the classical and sequential classification algorithms under varying levels of label noise. We present the "Sigmoid Rule" Framework, which allows choosing the most appropriate learning algorithm depending on the properties of the data. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, hinting at how the inherent properties of a dataset affect learning. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature

    Reviewing peer review: a quantitative analysis of peer review

    Get PDF
    In this paper we focus on the analysis of peer reviews and reviewers behavior in a number of different review processes. More specifically, we report on the development, definition and rationale of a theoretical model for peer review processes to support the identification of appropriate metrics to assess the processes main properties. We then apply the proposed model and analysis framework to data sets from conference evaluation processes and we discuss the results implications and their eventual use toward improving the analyzed peer review processes. A number of unexpected results were found, in particular: (1) the low correlation between peer review outcome and impact in time of the accepted contributions and (2) the presence of an high level of randomness in the analyzed peer review processes
    corecore